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AUTOMATIC THREE DIMENSION (3-D) 
WORD ALIGNMENT APPROACH 


Abdel Alnasser ALASFOUR’, Stefan TRAUSAN-MATU” 


Abstract. A massive effort is needed to build a parallel aligned corpus, so building a tool 
to for automatic alignment will be useful for natural language processing in general and 
information retrieval in particular. In our paper we present a new approach which mixed 
most of the known alignment techniques to achieve high precision and accuracy ratio 
without human intervention. A list of most English words was used as anchor list following 
the Pareto principle. 
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1. Introduction 


Parallel corpora are now one of the most important key resources for multilingual 
natural language processing including machine learning, information retrieval, 
and machine translation systems [2]. There are many large scale corpora available 
offline and online on the WEB. Our concern was to find and build a suitable 
framework for developing an alignment tool to build any parallel aligned corpus 
in general and building an Arabic-English parallel corpus in particular. The 
framework we created is using the available functions and procedures of the 
"Oracle Text” [1]. 


Our algorithms were developed in order to be applied directly to any target corpus 
which will be located in database tables. It gives us the ability to manipulate, 
analyze and evaluate the results for more accuracy. In order to build such a tool 
we started by investigating the latest methodologies and approaches in the field of 
bi-text alignment technologies. In the next sections we will describe in further 
details each step for achieving our main purpose. We start by teaching our system 
with the most English used words, keeping in our mind the Pareto principle [14], 
also known as Pareto law's which says 'For many events, roughly 80% of the 
effects come from 20% of the causes". 


Therefore, a list of 1000 common English words was translated to Arabic to be as 
an initial seed for our bilingual dictionary. This was very useful for developing 
our alignment tool so that we can align any parallel corpus in the next future. 
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These techniques can be used to align any other parallel corpus by creating a list 
of the most used words in both languages in order to facilitate the creating and 
building an alignment links for those desired parallel corpus. Building a parallel 
corpus at words’ level need a massive implementation efforts for achieving the 
desired results; starting by finding suitable well translated text files, segmentation, 
tokenization, stemming, sentences alignment, phrase alignment, words alignment, 
mapping between the two texts, and finally creating the parallel alignment corpus. In 
the next sections we will talk about parallel corpus in general. In the "OraLign" 
section we will describe the methodology we followed to align text. 


2. Related Work 


The main idea of a parallel corpus is a text in language "A" placed alongside with 
its translation in any other language "B", that means collecting and setup as much 
parallel text in one huge file known as parallel corpus [2]. This huge "parallel 
corpus" file must satisfy and be applicable in the linguistic domain research such 
as information retrieval, machine translation and many other applications in the 
field of natural language processing [6, 7]. The most important process in building 
a parallel corpus is the “alignment”, which is the mapping between the opposite 
text at many levels, paragraph, sentences and words level. There are many 
techniques for bilingual corpora alignment [6]. These methods can be categorized 
in three main categories: 


e Statistical approaches. 


In the statistical approaches there are two major applications that have been 
introduced. Both of them are length-based approaches, such as the length-based 
approach by Brown and Lai, which count the words in each sentence before 
building any alignment link [6]. The approach suggested by Gale and Church also 
depends on the count of characters in both opposite sentences before creating any 
alignment link [7, 8]. 


e Lexical approaches 


Most of the alignment techniques in this type of alignment depend on lexical 
sources such as bilingual dictionaries, grammar rule-based. 


e Hybrid approaches 

A combination of statistical and lexical approaches can be used to achieve 
bilingual corpus alignment 

3. OraLign 


Most of the alignment approaches have been applied to many bilingual corpora 
and they have been evaluated and have been successful in many applications. Our 
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main concern was to build an alignment tool "OraLign" for aligning Arabic- 
English bi-text. With respect to Arabic language the length-based approaches is 
not the optimal choice due to: 


1- Arabic structure of text. 

2- Arabic characters type. 

3- Grammatical differences between Arabic and English. 
4- Arabic rhetoric and syntax. 


These differences lead some times to get one into many sentence alignment gaps, 
or to blank alignment problems. On the other hand, depending only on lexical 
approaches will not give us the expected results due to many difficulties such as 
finding a suitable bilingual dictionary. In order to create the OraLign tool we 
applied a new technique which mixed many of known alignment techniques with 
extra addition and more modifications. 


OraLign as it will be describes in the next sections will be a language-independent 
word alignment tool. OraLign will mainly depend on an initial bilingual 
dictionary as a lexical anchor and a new Statistical approach called 3-dimension 
techniques. See Figure 1, which represents OraLign procedure and Figure 2, 
which shows OraLign three dimensions word alignment approach, where 
token_text is the word or token in the documents/sentence, token_first contains 
the ID number of the first document/sentence where that token occurs, while 
token_last is the ID number of the last document/sentence where that token 
appears, and finally the token_count will carry out how many times that token 
appears in all the documents/sentence of the corpus. 


Token_Text | Token_First Token_Last | Token_Count 


Document (n) 


Strike 1 mS 
4 = SubMarine 3 3 1 
Strike Strike 
3 4 
Submarine Strike 
Token_Text | Token_First Token_Last | Token_Count Document (n) 
jaaaae 1 4 
Fatt 3 3 1 


,4,3) > Strike €> pre 
3,1) Submarine €> 41 


Fig. 1. OraLign Main procedure description with an example. 
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Possible Alignment 


ae en ae 


Fig. 2. OraLign 3-dimension approach. 


4. Oracle text 


Language “~ 
Text 


| Llexer | 


indexing 


Bilingual 


List of common 
iS im both text 


Data Store Anatizing 


Fig. 3. OraLign implantation process. 
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Oracle was our first option as a basis for our framework; especially "Oracle Text" 
which offers a complete text search solution [1]. Another tools come also from 
Oracle: Developer 61 was used to create the GUI for OraLign tool, both of them 
running in the Windows OS environment. Our implementation can easily be 
applied in many different environments (MS Access, MYSQL). Figure 3 shows 
the implementation process and all the sub-processes, which will be describes in 
more details in the next sections. 


5. OraLign framework 


For the purpose of building our framework we decided to include lexical 
information as anchor points, which contain a list of the most common used words 
in English and then we translated these words to Arabic using a dictionary. This 
small dictionary will be the anchor list to establish our alignment algorithm "Initial 
dictionary". In the next sub-sections we will describe in more details OraLign. 


A. Loading documents 


Since Oracle supports the processing of any kind of documents (PDF, DOC, Text, 
etc.) with a massive support for most of different languages, the user needs to setup 
and configure the appropriate database tables to keep and save these documents. 


B. Lexer 


The main purpose for the lexer step is to split the documents into tokens according 
to the specified language of the document and the setting of the configuration 
parameters for that language; which include the declaration of the sentence 
borders (‘.’, *?’, ’!’), whitespace ('') as a separation character between document 
words, or any specific characteristics setting [1]. The output of this process is raw 
data of document tokens. Figure 4, shows a sample of document tokens after lexer 
processing. 


{ REPORT} , { USA } , { RESULT } , { STATEMENT } , { DAMAGE }, { INJURY } , 
{ OCCURRENCE }, { ESTABLISH } , { FIVE} , { WITHOUT }, { MAJOR } , { COLLIDE } 


, { GULF } , { PERISCOPE } , { SUBMARINE }, { FLEET }, { SHIP }, { IDENTITY } 


Fig.4. English document token list. 

C. Indexing 

There are many types of indexes that Oracle can support. For our purpose we 
implemented CONTEXT as an index option to maximize the ability of searching 
and locating any token no matter how large are the documents. Since the 
documents are stored in the database tables, it was very easy to select the most 
appropriate index option. Context indexing process creates several auxiliary tables 
[1]. One of these tables is what is known as the "I” or "Token List" table which 
contains all the document tokens as rows and it has many useful attributes. 


40 Abdel Alnasser Alasfour, Stefan Trausan-Matu 


The token list table "I" also contains information for linking the tokens to their 
document source. Context index supports most of known languages especially 
English language and it also supports Arabic with some attention and with 
suitable configurations. Figure 5, represents a sample of OraLign token list table 
and its main attributes which are: 


1- Token text. 

2- Token first: the ID number of the sentence /document in which the token 
appears for the first time. 

3- Token last: the ID number of the sentence/document in which the token 
appears for the last time. 

4- Token count: how many times that token appears in the document(s). 


TOKEN_TEXT | TOKEN_FIRST | TOKEN LAST | TOKEN COUNT |WORD_ORDER | 
DP TIDENTITY = 1 1 1 020 
2) ESTABLISH 1 1 1 023 
~_ | 3/WITHOUT 1 4 2 024 
4) RESULT 1 1 1 025 
DAMAGE 1 5 3 028 
6) OCCURRENCE -- 1 1 1 031 


Fig. 5. Modified token list with it is main attributes. 
D. Bilingual common words dictionary 


This dictionary contains 1000 of the most common used English words. It was 
collected and translated to Arabic in a direct way. We used this list to train our 
algorithm. Figure 6, shows a sample list taken from the initial bilingual dictionary 
used in our framework. 


ENGLISH [ARABIC 
Db IWATER -- ell 
2THAN = ge 
3CALL +. slow 
4 FIRST + Yoel 
[5|\WHO ge 
6) MAY pple 


Fig.6. Sample of "1000" startup dictionary. 


In this step a reference table creates a mapping between tokens in the startup 
dictionary table and the token list table for each token that appears in both lists. In 
other words, if any of the documents tokens is found in the startup dictionary a 
reference link will be created and saved in a table. 


E. OraLign Statistical model 


Depending on the output of each process, a statistical model is initialized to 
analyze each token’s property and check if there is any ambiguity before building 
and creating a possible alignment link [11, 12]. 
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Figure 7 shows a situation where two tokens have the same values for token first, 
token last and token count and it seems to be a possible alignment link that can be 
created between them. So, before building and creating this link, the statistical 
process will check all the tokens in both texts for any tokens which have the same 
attributes values. If the model finds any other tokens having the same attribute 
values then it will check the startup dictionary for the meaning of the tokens in 
both languages. If it exists, then it will check the dictionary values for both 
tokens. If it is the same then a link will be created, if no then the system will 
perform a second cycle after removing all the tokens that have been already 


linked. 
Token from text Token from text Z 
A B 


Tokens have the same 
attributs vilaues ? 


Create a link between the tow tokens Any Other Tokens Have the 


same Attributes values ? 


Next Tow Tokens 


Are The Tokens Exist in the 
Bilingual Dictionary ? 


Tokens have each other 
meaning 


Fig. 7. English document token list. 
F. Alignment Process 


After all the previous steps have taken place, the alignment process will start as 
shown in the alignment procedure. Figure 8, represents two parallel texts in 
English and Arabic has been loaded to OraLign tables. Figures 9 and 10 represent 
the tokens list for both documents after they have been loaded and indexed. 
Therefore, the alignment process which include several sub-steps, starts by 


42 Abdel Alnasser Alasfour, Stefan Trausan-Matu 


checking if there is any token in the startup dictionary, name entity, and any 
similar words exist in both documents/sentences [11]. In the next sections we will 
describe each sub-process in detail. 


English 
The periscope of a US submarine collided in the Gulf 
the day before yesterday. Thursday. with a ship whose 
identity was not established, Without resulting in major 
damage. or the occurrence of injuries, as the USA Fifth 
Fleet reported in a statement. 
The statement advised that the submarine Jacksonville. 
which is of the class Los Angeles, struck a ship dunng 
an operation in the Gulf, at 5.00 in the morning of 
Thursday, local ime (2.00 GMT). 
The statement added that the submarine surfaced 
following the incident. to ascertain whether the ship . 
which was not ident fied, had received damage or not . 
But the ship continued moving on the same course and 
at the same speed. without sending out a distress call. 
One of the two periscopes of the submarine was 
damaged. but the incident did not affect its nuclear 


Figs ped I peel pe st « eal 8 AS esl Gulf lie pe 


- bao gan 51S pS tact 8 Canal 453 pe « Ledge sani al 


HY 3 eal 6S et yw suai LS 

Ancol salt fr Gllt sai, ip uSla « qulall Gules oe as 
past cle Racal 98 cals 23 Glee As tes Cree 

- 8 A) Sl es ST 

AAS WD) Lae (gan 6 Cesta Fb Saabs Kal galt Gi lull CGLni, 
2 TY ab eel: Cred Ugo a pail 2% a) at dial 

Lgesd dc paillsy Lgesdi Aga gl 8 eee Cool, Asal OSI 

Bool gall cles ani eet 5 DAs old GAL! 53 Ge 

- GSA Daas asi! iglclae le Gta Fp: als 


reactor and ulsion engine. 
Fig. 8. Two parallel texts in English and Arabic. 
TOKEN_TEXT TF J1lL |tc |wot TOKEN_TEXT JTF [TL [TC {wo1 | 
>| 1| PERISCOPE 1 5 2 002 Sa me 1 2 2 001 
2, USA 1 1 2 005 [| 2} suas. 1 5 2 002 
| | 3| SUBMARINE 1 5 4 006 [| al belt 1 5 4 003 
[| 4| COLLIDE 1 1 1 007 4| pS ouel 1 1 > 004 
5 GULF 1 2 2010 | | Sl aye 1 2 2 006 
6) SHIP 1 4 4018 is | Soydwe 1 4 4011 
pa ESTABLISH 144 oe -& a 
3) WITHOUT 1 4 2 024 8 _ 1 i 4 a 
10) RESULT 1 1 1 025 irl os 7 7 1 O17 
|| 11) M4JOR 1 1 1 027 “Tent == 7 5 31019 
|| 12) DAMAGE = 1 5 3 028 lai ne 1 i ito2n 
13) OCCURRENCE -- 1 1 1 031 Tal te nax : ; to22 
14) INJURY 1 1 1 033 js 3 
[45] Five 1 2. 2037 14} Arbo} y 4 11023 
[_[46| FLEET 1 2. 2038 aie 2} 025 
17| REPORT 1 1 1 039 |_| 16} clolaw i 1 1 1 026 
18) STATEMENT 1 3 3 042 |__| 14] wots 1 2 2 028 
| | 19) ADVISE = 2 2 1 003 | 118) ole 1 3 3 030 
| | 20) JACKSONVILLE --- 2 2 1 007 19) Js 2 2 1 008 
21| CLASS 2 2 1 012 |_| 20} .pctec- 2 2 1003 
| | 22)Los 2 2 1013 21) tee 2 2 1014 
|| 23) ANGELES 2 2 1014 |_| 22) +35 2 2 1 O16 
| | 24] STRIKE 2 2 1015 |__| 23} <ptemre 2 2 1 Ol 
25) DURING 2 2 1018 || 24) pine 2 2 1 020 
|| 26| OPERATION 2 2 1 020 |_| 25] e» 2 2 1 024 
| | 27| MORNING 2 2 1 028 26) ws! 2 2 1 025 
|| 28) LOCAL 2 2 1 031 27| wlesi 2 2 1 026 
29) TIME 2 2 1 032 Me pesreces 2 2 1 027 
30) GREENWICH 2 2 1 034 29) GLsi 3 3 1 002 
|_| 31) 4DD 3 3 1 003 |__| 30) Lite 3 3 1 006 
|__| 32) SURFACE 3 3 1 ooF | | 31| 59 3 3 1 O07 
34 INCIDENT 38 oo [aa oe a 
| | 35| ASCERTAIN 3 3 1 012 33 5 3 3 : oo3 
36) IDENTIFY 3 3 1019 35] yas 3 3 1 O17 
|_| 3?7| RECEIVE 3 3 11021 36 aol 3. 3 1019 
|| 38) CONTINUE 4 4 1 004 [Tari del 4 4 1 003 
x] 
|_| 33) MOVING 4 4 1 005 1 3e) 4 ri 1 004 
oe 
40) SAME 4 4 4 008 3a) fem | 4 1/006 
|| 41) COURSE 4 4 1 009 = ne a 4 O07 
42) SPEED 4 4 1014 y 
|_| 43| SEND 4. 4 1 016 Ei} 422-2 4 _4}__1, 009 
[| 44| out 44 1017 42| gue) 4 _ |__| 013 
45| DISTRESS 4 4 1019 TES ots 4 4 014 
|_| 46] CALL 4 4 1 020 [TBE Oa! 4,4 1015 
|_| 47) AFFECT 5 5 1016 |_| 45) 31 5 5 1 008 
48) NUCLEAR 5 5 1018 46} Je Lee 5 5 1911 
|_| 493) REACTOR 5 5 1019 47 S20 5 5 1012 
| | 50) PROPULSION 5 5 1 021 48) Sse 5 5 1014 
51/ ENGINE 5 5 1 022 49) 4250 5 5 1/015 


Fig. 9. English document token list. 


Fig. 10. Arabic document token list. 
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1- Start-up dictionary: 


Figure 11 shows the tokens that have been translated using the startup dictionary. 
For more accuracy the system must be sure that both sides of the dictionary tokens 
exist in the opposite document to avoid any miss-translation errors. In other 
words, if the English word is founded in the dictionary and the translated word 
does not exist in the Arabic document; it will be neglected. In our example 
OraLign found 12 tokens and they are ready to be linked to each other. 


EN_TOKEN  |EN_TF [EN W_ORDER  |AR_TOKEN  |AR_TF |AR_W_ORDER | 
1 SHIP = 1 018 igh a 1011 
2) MAJOR 1 027 ons 1 020 
(3) class 2 012 Es 2 024 
_ 4] DURING 2018 Ses 2 008 

5| MORNING 2 028 thee 2014 
6 TIME 2 032 cB 5 2 016 
7/ ADD : 3 003 SLs) 3 002 
8) CONTINUE --- 4 004 bel 4 003 
3) SAME 4 008 ys 4 007 
10] SPEED 4014 de we 4 003 
11) CALL 4 020 elo 4.014 
12) ENGINE 5 022 ym 5 014 


Fig. 11. List of tokens founded in the dictionary. 
2- Named Entity Recognition 


In many cases the Arabic document contains named entities for persons and 
places [4]. When these names are translated from English to Arabic or vice versa, 
they will be written using the target language characters and depends on the 
source language pronunciation for that name; as an example, the country name 
"Romania" will be written in Arabic as “Wiles” which is the same pronunciation 
as it is in English language. For that reason we create a special procedure to 
extract the named entities from Arabic document and then we compare them with 
those in the English document [13, 14]. Figure 12, shows the named entities 
which have been founded in both documents. 


_| (TOKEN TEXT | ENTF EN TL EN TC [EN W |ARTOKEN ARTF |ARTL |ARTC |ARW | JWS _| 
)| 1 JACKSONVILLE ~- 2 2 1007 hbpwSle 2 2 1 027 88 
20S 2 2 1013 wg 2 2 1 025 a1 
ANGELES =~ 2 2 1014 yu 2 2 1 026 82 
4) GREENWICH 2 2 1034 phage 2 2 1 020 82 


Fig. 12. A list of the names entity. 
3- Similar tokens extraction 


In this step, OraLign will locate and extract any similar tokens found in both 
documents. Many of Arabic documents that we considered are a mixture of 
scientific or medical articles. In such documents you will find foreign words 
mainly in Latin, which are written as same as they are in the original documents. 
As an example, Figure 13 shows two similar tokens that appear in bi-text. 
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English Arabic 


Mitsubishi introduces 2002 Pajero new features! PSape Sh jee Ve eV 5 eel ope te ba glob pp i te 


In August, Mitsubishi announced the new standard and 
optional equipment on the 2002 Pajero. 

New standard equipment includes INVECS-I1 4 
automatic transmission with Sporfs Mode and aischarge 
headlamps, while front seatback pockets and illuminated 
vanity mirrors with lids fitted on both sun visors are 
among the standard utility features. 


ALY Sl pga ge go Yu cols 3g 5 dt pte cual 

J YeeY supe lg: Sgt i Sapa deta 

AO Sel ye Sila gig Sm AG Saad Cd ged PAE 

gs BE ae Als SUS ely Ja INVECS-I4 
GAD 9 GMa goes a ld 

5 SL pee Sy de Kile Seth Say pa Rina dash LI yo 

A te Ses 5g Se 2 5 


Fig.13. Similar tokens example. 


After the three previous steps finished, OraLign will remove out all the tokens 
from both token list tables, and the remaining tokens will be moved forward to the 
main procedure of OraLign which is the 3-D approach. In the 3-D procedure there 
will be as many cycles as are needed to align as much as possible tokens in both 
documents. For that reason the token list will be divided depending on the 
token_first value. So all the tokens which are in the first document/sentence will 
be in one group (sub-list) with TF=1, and so on. The tokens in the sub-list will be 
sorted in descending order based on the value of word order column. This step is 
prerequisite for OraLign to start searching and mapping any possible alignment 
tokens. Figure 14 presents an example of how the tokens list is divided to many 
sub-list depending on how many documents are there in the corpus. 


TOKEN_TEXT Eo iS iS i TOKEN_TEXT JTF [Tt 1c [wo 
> 1 REPORT = 039 1 1 7 |b TADVISE = 2 2 7 003 
— ed 2| STRIKE 2 2 1015 
2 FLEET =| 038 iL iL 1 3) OPERATION 2 2 1 020 
3) INJURY --- O33 1 1 1 4 LOCAL 3 3 1 031 
_ 4; OCCURRENCE --- 031 1 1 1 
_| 5) RESULT -- | 025 1 1 1 TOKEN_TEST iis fue Te (wor 
6) ESTABLISH -- O23 1 1 1 >) 1 SURFACE = 3 3 1 ooF 
7 IDENTITY -- 020 1 1 1 2) FOLLOWING 3 3 1 008 
3) INCIDENT =a [=} 2010 
8) COLLIDE -- OOF 1 1 1 4) ASCERTAIN 3 3 1 o12 
3) USA -- 005 2 1 1 5) IDENTIFY 3 3 10139 
10) FIVE =| O37 2 2 1 6) RECEIVE 3 3 1 021 
11) GULF - O10 2 2 1 
42)STATEMENT -- 042 3 3 1 TOKEN_TEXT [TF me ne wot f 
13|WITHOUT 024 | Ee Scones — es] —al —al —a] bos 
14)PERISCOPE  -- 002 2 5 1 3] SEND 4 4 aLGie 
__| 15) DAMAGE -- 028 3 5 1 4) OUT 4 4 1 O17 
16)SUBMARINE  --- O06 4 5 1 5| DISTRESS 4 4 1019 


Fig.14. List of tokens in sub-list TF=1 ,TF=2 ,TF=3 and TF=4 
6. Practical alignment process 


After collecting all the information, the system is now ready to begin and build any 
possible alignment link between the appropriate suitable tokens from both 
documents. To develop our algorithms we applied our new method “3-D” 
alignment approach. First, the system removes out all the translation tokens, named 
entities, and similar tokens from both token lists and keeps all the other tokens 
which need to be mapped and link [10]. In our example, the final remaining tokens 
that need to be aligned are 50 tokens after removing 12 translated tokens and 4 
tokens have been linked as named entity “none tokens are in the similar list”. In the 
next section we demonstrate an alignment process for the remaining tokens in 
sentence one (TF=1) as an example of how our algorithm will work. 
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Sentence (TF=1) analyzing and mapping 


STEP 1 
TOKEN_TEXT [W [TC |1L [IF | TF [TL |1C |W | TOKEN_TEXT 
>| 1/STATEMENT -- 042 3,3 1 rp) 1 1 3 3030 | oly = 
2) REPORT | 039 1 1 1 2 1 2 2028 weld 
3) FLEET ~ 038 1 1 1 Lin 1 1026 obwi 
4) FIVE = 037 2] 2 1 [41 2 2025 oli 
5| INJURY = 033 1 1 1 [5 1 1 1023 We) 
6| OCCURRENCE =. 031 1 1 1 (6 1 1 1/022 | Sgam 
7| DAMAGE ~ 028 3 «5 1 7 1 5 3019 je 
8| RESULT | 025 1 1 1 8 1 1 TOT? | gape 
9) WITHOUT ~ 024 2. 4 1 [31 4 2016 | ase 
JOESTABLISH -- 023 1 1 1 ‘101 1 1014 ye 
11) IDENTITY ~ 020 1 1 1 [a 1 1013 oom 
12) GULF - 010 2 2 1 12) 2 2006 ae 
13) COLLIDE ~ 007 1 1 1 13 1 1 2004  <pSyaei 
14)SUBMARINE -- O06 4 5 1 14 1 5 4003 dolet 
15| USA = 005 2 1 1 (151 5 2 002  jUbse 
16, PERISCOPE -- O02 2. «5 1 16 1 2 2001 exo 


T 


1:3:3 


1:5:3 


1:5:4 


1:5:2 


Fig. 15. Tokens tree for TF=1. 


<>) 
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In this step the system builds a tree in which each token is represented as a circle, 
each token with more than one occurrence is presented as a circle with recursive 
arrow, and each token with occurrence | is presented as a simple circle. Any 
circle with arrow —occurrence more than one- can be linked to any other circle as 
father relation, but the simple circle will not be able to connect to any other token 
in the same list. 


STEP 2 


After removing out the matching tokens from step (1) the system will divide the 
tokens in TF=1 to several parts. Each part border will be the removing tokens - 
see Figure 16. 


TOKEN_TEXT ao (ete (uel ule 

> I/STATEMENT -- 042 3 3 
2) REPORT 039 1 1 

| 3) FLEET 038 1 1 
| 4) FIVE -- O37 2 2 
| 5) INJURY | 033 1 1 
6) OCCURRENCE --- 031 1 1 

7| DAMAGE -- 028 3 5 

| 8) RESULT | 025 1 1 
3) WITHOUT | 024 2 4 
IOESTABLISH -- 023 1 1 
11) IDENTITY -- 020 1 1 
12) GULF -- 010 2 2 
13) COLLIDE -- OOF 1 1 
14)SUBMA4RINE  -- 006 4 5 
15| USA, 005 2 1 
16|)PERISCOPE -- 002 2 5 


037 
nd 
ee 
<> cc 
<>) 


Coo7 > 


are ae ee ee eee eee ey 


ree er eee ee eee er ey 


TF {TL f1C [Ww [TOKEN_TEXT | 


3 030 

2028 wold 
1026 obwi 
2025 oldi 
1023 tls) 
1022 See 
3019) 55> 

1 O17 ws 
2016 gs 
1014) 445% 
1013 som 
2006 ae 
2004  pSon0i 
4 003 Holst 
2002 js. 
2001 exo 


NOON]? ]-f&-OAANANw 


ce) 


Fig. 16. List of tokens in sentence "1" — TF=1. 
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Before building such a tree the system will sort the tokens in descending order 
depending on the token order in the sentence. Then OraLign will start to build the 
first token which has the high order (last word) in the sentence. Figure 15, shows 
the tree and the arrows and also shows the way it is be created. For English text 
the tree and its branches are created from left to right as it is the same in reading 
the English text. On the other side the tree for Arabic document will be created 
from right to left as it is the same when reading Arabic text. After setting up the 
tree for both documents, the system will compare the tokens attributes values 
from up to down depending on the value of TF, TL, TC and then build a link 
between those tokens. If there is any suspicion of ambiguity in the tree caused by 
many tokens having the same values for TF, TL, and TC, the system will pass to 
the next token. When the first cycle is finished the system will divide the sub list 
into many extra lists after removing out the tokens which already have been 
linked to each other in the first rotate. 


[TOKEN_TEXT [W [TC [TL [IF | [_[tF_ [Tl [tC _[W__[TOKEN_TEXT | 
>| 1) REPORT ~ 039 yy al T 22028 yal = 
2 FLEET ~ 038 1 11 2 111026 lala! 
| 3 FIVE “037 2. 2 «4 [31 2 2025 oui 
[4] INJURY 033 a fal 1 11023 ley 
| 5|OCCURRENCE -- 031 1 11 [5111022 bigam 


Fig. 17. Tokens tree for TF=1 , Part=1. 
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STEP 3 


The first divided part of TF=1, will start mapping from up to down and from the 
right side of Arabic tokens to the left side of English tokens as shown in figure 17. 


STEP 4 


In this step a direct link will be build since there is just one token in each side, see 
figure 18. 


[TF {TL 1c |W |TOKEN TEXT || | |TOKEN_TEXT [W [TC [TL [TF | 
D1 TTT ~ {pL RESULT = 0 1 #1 1 


Fig. 18. Tokens tree for TF=1, Part=2. 
STEP 5 


In the final step for TF=1 as shown in Figure 19, there exist an Arabic token 
which has occurrence value more than one and that token have been linked to 
English token with occurrence value equal to one. In this case the system will 
keep in mind —memory - that Arabic token and move a copy of it to the next 
sentence to try to find any dominated English token in that sentence . 


TOKEN_TERT [Ww 
> ESTABLISH -- | 023 
| 2) IDENTITY -- | 020 


1 013 2S 


alolyls 


1 1 1 
1 1 1 
GULF --/ 010 2 2 1 HS 
COLLIDE --- | OOF 1 1 1 2 001 Peed 


TF [tl (tC (Ww _|TOKEN_TEXT J 
> 7/014 4,08 = 


slolnyl 
a4 4) 
NIN/4/—4 

N 

i=) 

i=) 

nD 


Next level(s) 


? \ 
C Pe 


Fig. 19. Tokens tree for TF=1, Part=3. - 
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TF = 1 alignment outcomes 


a 

)/ (PERISCOPE ST 7 ; i: 7 sit 9 
PTausA | O08 2 1 i | 2004 gba 
| SUBMARINE - 006 4 5 i er cre 
Pa WITHOUT 024 a 4 i 4 208g 
TTSIDAMAGE ~~ 028 7 «5 i 509 
| GISTATEMENT ~ 042 7 3 i 3 300 gle 


Fig.20. Tokens tree for TF=1, step=1 


“+ Step 3 output 


EN_TOKEN JEN W JEN TC JEN TL |ENTF |AR TF |ARTL |AR_TC |AR W {AR TOKEN 
») 1 OCCURRENCE - 031 1 1 1 1 1 1022 Gye 
|_| 2 INJURY » 033 1 1 1 1 1 1023 tba) 
3 FIVE ~ 037 2 2 1 1 2 2028 = yw 
| 4) FLEET ~» 038 1 1 1 1 2 202 © 
5! REPORT oe ARG 1 1 1 1 1 1 026 Ary 


Fig.21. Tokens tree for TF=1, step=3 


“+ Step 4 output 


ENTOKEN JEN W JEN TC JEN TL jEN TF JAR TF JAR TL |AR TC jAR W |AR TOKEN 


| FIDENTIIY ~ (2 
| A ESTABLISH » 023 


= 


| 
2 
| 
| 


Fig.23. Tokens tree for TF=1, step=5 


— | — | Pa Pes 


10d uh 


7. The analysis of the OraLign results 


Table 1 and Table 2 present the details about both English and Arabic documents 
respectively. The percentage share of the bilingual dictionary in the alignment 
process was “24%”. While the percentage shares of named entities extractions 
process was “8%” that leaves “68%” for the 3-D share in the whole alignment 
process. OraLign will give more accuracy result when align large number of 
documents. 
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Table 1 
ay Number Of Tokens In Startup Named Remaining Token: 
Sen ee IES Tokens Bice : Entities rea : 
1 18 2 0 16 
2 12 4 4 4 
3 7 1 0 6 
4 8 4 0 4 
5 5 1 0 4 
TOTAL 50 12 4 35 
Table 2 
pea Number Of Tokens In Startup Named Remaining Token: 
ote LPT HS Pee Bidens Entities orn : 
1 18 2 0 16 
p 10 4 4 9 
3 7 1 0 6 
4 8 4 0 4 
> 5 1 0 4 
TOTAL 48 12 4 32 


8. OraLign evaluation 


For evaluating our method we used two documents, each one a translated version 
of the other. Both documents contain 5 sentences and both of them contain a lot of 
what are called stop words such as "in, on, to, the, this" in the English document 
and "G+, o4, cle..." in Arabic. Figures 24 and 25 represent a list of stop words 
that are removed from both documents before running any further steps. 


Fig. 24. English stop words sample. 


‘| JID [STOPW [LANG 

>| 1] 1 IN EN > 1 

[4 2 THE EN ie 

| 3 AND EN f= 

|| 4  40N EN [ 4 
5 5 WITH EN he 
6 866A EN «6 
7, «7 OF EN | [7 
8 §Too EN | 8 
9 9OR EN | 9 
10) 10 ALSO EN 10 


| |ID |STOPW | LANG 


1 o 


Fig.25. Arabic stop words sample 
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After removing out the stop words from both documents; the number of remaining 
tokens was 50 for the English document and 48 in the Arabic document. For 
instance the maximum number of links OraLign can build is 50 alignment 
relations (English tokens). 

e Tokens in the start-up dictionary are 12 

e Tokens in the Named Entity list are 4 

e Tokens in the OraLign List are 33 


The final number of tokens in all ways are (49), and the reason for not reaching 
the maximum number of possible link is that one of the English tokens has not 
been linked to any Arabic tokens, which is "ADVISE" see figure 26. 


TOKEN_TEXT [W [TC [IL |IF_| 


>| 1) ADVISE + O03 1 2 2 
Fig. 26. English tokens not aligned. 


On the other hand, there exists one Arabic token that has been linked with two 
different English words from the English list, which is (‘e="), and both of them 
are correct, see figure 27: 


EN TOKEN JENW JENTC JEN TL |ENTF |ARLTF |ARTL |ARTC (AR W |AR TOKEN | 


pW COLUIDE » 007 | 1 1 1 2 2001 pe 
OSTRIKE O18 | 2 2 | 2 200 fe 


Fig. 27. Same arabic token linked with two different english tokens. 


Depending on the output of all previously steps we can evaluate our algorithms by 
calculating the Precision, Recall and f-measure (f-score) for checking the accuracy 
and error rate for our method [3, 4]. 


Precision and Recall are the most know basic measures to evaluate finding a 
specific relevant item within a huge list of items [3, 4]. In further details recall is 
used to "calculate the ratio of the number of relevant records retrieved to the total 
number of relevant records in the database". In the other hand precision is used to 
“measure the ratio of number of relevant records retrieved to the total number of 
irrelevant and relevant records retrieved". Both of "Recall" and "Precision" are 
usually expressed as a Percentage. Figure 28, describes "Recall" and "Precision” 
for any information retrieval system in general. 


In our case the total number of records (tokens) is 50. The number of tokens that 
have been linked was 49, and the correct relations were 43. Suppose we present 
our results in suitable variables such as: 


e Number of relevant tokens linked.43 
e Number of relevant tokens not linked. 1 
e Number of irrelevant tokens retrieved.7 
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Trrelevant Items - Retrieved 


Relevant Items — Not Retrieved 


Relevant Items - Retrieved 


Recall = ——_—S— X 100% Precision = = ——— X 100% 
A+B A+C 
Fig. 28. Recall and Precision descriptions and Formulas. 


Since we know that 6 of the relations are not correct we can compute and find out 
A, B and C: 


A=49-6 > 43, B=50-433 7, C=49-43 2 6. 


From the above values we can calculate and compute both Recall and Precision 
respectively: 


“+ Recall = A/(A+B) = 43/(43+7) > 86% 
“* Precision = A/(A+C) = 43/(43+6) =» 87% 


Fig. 29. Recall and Precision chart. 


For more evaluations we can compute the value of f-measure (f-score) which is 
normally used to measure overall "search" accuracy by depending on the 


outcomes of both recall and precision [3]. Formula land 2, shows the f_measure 
(f-score) standard formula and it is result. 


f an Precisin .Recall 

memsare * precision+Recall 1 
c 7 0.87+0.86 87+0.86 

[ {measure =2* 0374086 = 0.86 | 2 


Figure 29 shows the results of Recall, Precision, and F_measure for each 
sentence. 
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9. Conclusion and future works 


In this paper we have introduced a novel method for bi-text alignment at words 
level and this is done depending on 1000 common English words which include 
the stop words. 


Next step was the building of a tool for automatic tokens’ alignment, which was 
described and evaluated. 


This method can be applied to any bilingual set of files (corpus). 


Oracle in general was a perfect option for planning, creating and testing OraLign 
tool. 


Furthermore, Oracle text in particular with its useful utilities gives a massive 
support for information retrieval. 


Since OraLign evaluation results shown an accepted result in terms of Recall, 
Precision and f_measure [3, 9], in next future we will try to maximize accuracy 
ratio by train OraLign with different categories of bi-text. 
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